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Abstract. Galled networks, directed acyclic graphs that model evolutionary his- 
tories with reticulation cycles containing only tree nodes, have become very pop- 
ular due to both their biological significance and the existence of polynomial time 
algorithms for their reconstruction. In this paper we prove that Nakhleh's m mea- 
sure is a metric for this class of phylogenetic networks and hence it can be safely 
used to evaluate galled network reconstruction methods. 

1 Introduction 

Phylogenetic networks have been studied over the last years as a richer model of the 
evolutionary history of sets of organisms than phylogenetic trees, because they take 
into account not only mutation events but also reticulation events, like recombinations, 
hybridizations, and lateral gene transfers. Technically, it is accomplished by modifying 
the concept of phylogenetic tree in order to allow the existence of nodes with in-degree 
greater than one. As a consequence, much progress has been made to find practical al- 
gorithms for reconstructing a phylogenetic network from a set of sequences or other 
types of evolutive information. Since different reconstruction methods applied to the 
same sequences, or a single method applied to different sequences, may yield different 
phylogenetic networks for a given set of species, a sound measure to compare phylo- 
genetic networks becomes necessary [11]. The comparison of phylogenetic networks is 
also needed in the assessment of phylogenetic reconstruction methods [10], and it will 
be required to perform queries on future databases of phylogenetic networks [14]. 

Several distances for the comparison of phylogenetic networks have been proposed 
so far in the literature, including generalizations to networks of the Robinson-Foulds 
distance for trees, like the tripartitions distance [11] and the -distance [1,6], and dif- 
ferent types of nodal distances [2,5]. All polynomial time computable distances for 
phylogenetic networks introduced up to now do not separate arbitrary phylogenetic net- 
works, that is, zero distance does not imply in general isomorphism. Of course, this is 
consistent with the equivalence between the isomorphism problems for phylogenetic 
networks and for graphs, and the general belief that the latter lies in NP— P. Therefore 
one has to study for which interesting classes of phylogenetic networks these distances 
are metrics in the precise mathematical sense of the term. The interest of the classes 
under study may stem from their biological significance, or from the existence of re- 
construction algorithms. 
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This work contributes to this line of research. We prove that a distance introduced 
recently by Nakhleh [12] separates semibinary galled networks (roughly speaking, net- 
works where every node of in-degree greater than one has in-degree exactly two and 
every reticulation cycle has only one hybrid node; see the next section for the exact def- 
inition, |7, 8| for a discussion of the biological meaning of this condition, and [9, 13] for 
reconstruction algorithms). In this way, this distance turns out to be the only non-trivial 
metric available so far on this class of networks that is computable in polynomial-time. 

2 Preliminaries 

Given a set S of labels, a S-DAG is a directed acyclic graph with its leaves bijectively 
labelled by S. In a 5-DAG, we shall always identify without any further reference every 
leaf with its label. 

Let A'^ = {V,E) be a 5-DAG. A node is a leaf if it has out-degree and internal 
otherwise, a root if it has in-degree 0, of tree type if its in-degree is < 1, and of hybrid 
type if its in-degree is > 1. A'^ is rooted when it has a single root. A node v is a child 
of another node u (and hence m is a parent of v) if (m,v) € E. Two nodes with a parent 
in common are sibling of each other. A node v is a descendant of a node u when there 
exists a path from u to v: we shall also say in this case that u is an ancestor of v. The 
height h{v) of a node v is the largest length of a path from v to a leaf. 

A phylogenetic network on a set S of taxa is a rooted 5-DAG such that no tree 
node has out-degree 1 and every hybrid node has out-degree 1 . A phylogenetic tree is 
a phylogenetic network without hybrid nodes. A reticulation cycle in a phylogenetic 
network is a pair of internally disjoint paths from a tree node (its source) to a hybrid 
node (its target). 

The underlying biological motivation for these definitions is that tree nodes model 
species (either extant, the leaves, or non-extant, the internal tree nodes), while hybrid 
nodes model reticulation events. The parents of a hybrid node represent the species 
involved in this event and its single child represents the resulting species (if it is a tree 
node) or a new reticulation event where this resulting species gets involved into without 
yielding any other descendant (if the child is a hybrid node). The tree children of a 
tree node represent direct descendants through mutation. The absence of out-degree 1 
tree nodes in phylogenetic network means that every non-extant species has at least 
two different direct descendants. This is a very conmion restriction in any definition of 
phylogeny, since species with only one child cannot be reconstructed from biological 
data. 

Many restrictions have been added to this definition. Let us introduce now some of 
them. For more information on these restrictions, including their biological or technical 
motivation, see the references accompanying them. 

- A phylogenetic network is semibinary if every hybrid node has in-degree 2 [1], and 
binary if it is semibinary and every internal tree node has out-degree 2. 

- A phylogenetic network is a galled network, when every non-target node in every 
reticulation is a tree node [7, 8]. 



Two hybridization networks N,N' are isomorphic, in symbols A'^ = A'', when they 
are isomorphic as directed graphs and the isomorphism sends each leaf of A'^ to the leaf 
with the same label in A''. 

3 On Nakhleh's distance m 

Let us recall the distance m introduced by Nakhleh in [12], in the version described in 
[3]. Let A' = (V.E) be a phylogenetic network on a set S of taxa. For every node v G V, 
its nested label Xn{v) (or simply X (v) when there is no risk of confusion) is defined by 
recurrence as follows: 

- If V is the leaf labelled i, then Xn{v) — {/}. 

- If V is internal and all its children vi , . . . , v^^ have been already labelled, then Ajv(v) 
is the multiset {Xn{vi), . . . ,?i.N{vic)} of their labels. 

The absence of cycles in A^ entails that this labelling is well-defined. 

Notice that the nested label of a node is, in general, a nested multiset (a multiset of 
multisets of multisets of. . . ), hence its name. Moreover, the height of a node u is the 
highest level of nesting of a leaf in A (m) minus 1. 

Now, it is easy to prove from the nested label definition, the following result. 

Lemma 1. LetN = {V,E) be a phylogenetic network on a set S of taxa. 

- If{u,v) e E, then A(v) e A(m); 

- If there is a path from u to v, then there exists a set of nodes ui,...,Uk such that 
ui=u,Uk = v andX{ui) € X{ui+i) for every ; = 1, ...,A; — 1. □ 

The nested labels representation of N is the multiset 

X{N) = {Xn{v)\v€V}, 

where each nested label appears with multiplicity the number of nodes having it as 
nested label. Nakhleh's distance m between a pair of phylogenetic networks N,N' on a 
same set S of taxa is then 

m{N,N') = \^{N)AX{N')\, 

where the symmetric difference and the cardinal refer to multisets. 

This distance trivially satisfies all axioms of metrics except, at most, the separation 
axiom, and thus this is the key property that has to be checked on some class of networks 
in order to guarantee that m is a metric on it. So far, this distance m is known to be a 
metric for reduced networks [12], tree-child networks [3], and semibinary tree-sibhng 
time consistent networks [3] (always on any fixed set of labels 5). It is not a metric for 
arbitrary tree-sibling time consistent networks [3]. And, we will prove here, that it is a 
metric for semibinary galled networks, which implies that it is also a metric for galled 
trees and 1-nested networks. 



4 The distance m for galled networks 



In this section we prove that the distance m defined above separates galled networks up 
to isomorphism. 

First of all, notice that if a galled network = {V,E) has no pair of different nodes 
with the same nested label, then for every pair of nodes u,v£V, we have that (m, v) G £ 
iff A,n{v) G Xn{u). Indeed, on the one hand, the very definition of nested label entails 
that if (m,v) G E, then Xn{v) E Xn{u); and conversely, if Aa;(v) e Xn{u), then u has a 
child v' such that Aiv(v') = Ajv(v), and by the injectivity of nested labels, it must happen 
that v = v'. 

This clearly implies that a galled network without any pair of different nodes with 
the same nested label can be reconstructed, up to isomorphisms, from its nested la- 
bels representation, and hence that non-isomorphic galled networks without any pair of 
different nodes with the same nested label always have different nested label represen- 
tations. Therefore, it remains to prove the separation axiom of Nakhleh's distance for 
galled networks with some pair of different nodes with the same nested label. 

The general result will be proved by algebraic induction on the number of pairs of 
different nodes with the same nested label. To this end, we introduce a pair of reduction 
procedures that decrease the number of pairs of different nodes with the same nested 
label in a semibinary galled network. Each of these reductions, when applied to a galled 
network with n leaves and with at least one pair of different nodes with the same nested 
label, produces a galled network with n leaves and one pair less of different nodes with 
the same nested label. Moreover, given any galled network with more than one leaf and 
with at least one pair of different nodes with the same nested label, it is always possible 
to apply to it some of these reductions. 

(R) Let be a galled network, let m ^ v be a pair of sibling nodes such that A (m) = 
A (v) and assume that u and v have the same children which are the hybrid nodes 
hi,...,hk. The, Rn-^y^hi,...,ht reduction oiN is the network /?„.v;ftj,...,ftjj (A') obtained by 
removing the nodes u,v,hi,...,hk, together with their incoming arcs, and adding an 
arc from the parent of u and v to each child of /ii, cf. Fig. 1.^ 

(T) Let A'^ be a galled network, let m ^ v be a pair of no sibUng nodes such that X{u) = 
X (v) and assume that u and v have the same children which are the hybrid nodes 
hi,...,hk. Let x,y be the parents of u,v respectively, and notice that these nodes 
must be of tree type, since otherwise A'^ would contain a reticulation cycle with 
hybrid internal nodes. The T^-^ir^i hi, reduction of A^ is the network Ti,-^v;hi hi,{N) 
obtained by removing the nodes u,v,hi,...,hk, together with their incoming arcs, 
and adding a hybrid node h with a tree child w and arcs from x and y to h, from h 
to w, andfrom w to each child of /ji, cf. Fig. 2. 

Notice that in both cases, the resulting network is a galled network since, in the 
first case, we simply remove hybrid nodes, and in the second one, we simply replace k 



In graphical representations of iiybridization networks, we sliall represent liybrid nodes by 
squares, tree nodes by circles, and indeterminate (that is, that can be of tree or hybrid type) nodes 
by pentagons. 



hybridization cycles by only one, without adding any hybrid intermediate node. Also, in 
both cases the number of pairs of different nodes with the same nested label decreases 
in a unit. Last, we remark that in any case, the nodes wi , . . . , in the resulting network 
have disjoint sets of descendants. Indeed, if some different nodes w, and wj share a 
descendant y, then there exists a common hybrid descendant h and a reticulation cycle 
having h as its target and x as its source. This cycle would induce a cycle in the original 
network that would contain hybrid nodes, hence yielding a contradiction. 




Fig. 1. The Ru,v;h,,...,h reduction. 




Fig. 2. The 

1'u,vM,...,hk reduction. 



Now we have the following basic applicabihty result. 

Proposition 1. Let N be a galled network with a pair of dijferent nodes with the same 
nested label. Then, at least one RorT reduction can be applied to N, and the result is 
a galled network. 

Proof. Let A' be a galled network with a pair of nodes u^v such that Xn{u) = Xisi{v). 
Without any loss of generahty, we assume that v is a node of smallest height among 



those nodes with the same nested label as some other node. By definition, v can- 
not be a leaf, because the only node with nested label {;}, with ; € S, is the leaf la- 
belled ;. Therefore v is internal: let vi, . . . ,Vi: (k^ 1) be its children, so that A,n{v) = 
{Xn{vi),- ■ .,XN{vk)}- Since Aa?(m) = Xn{v), u has k children, say ui,...,Uk, and they 
are such that Aa?(v,) = Aa?(m,) for every ; = l,...,k. Then, since vi, . . . , v^^ have smaller 
height than v and by assumption v is a node of smallest height among those nodes with 
the same nested label as some other node, we deduce that v, = m, for every ; = l,...,k. 
Therefore, vi , . . . , vj; are hybrid, and their only parents (by the semibinarity condition) 
are u and v. Hence, we can apply the R reduction when u and v are sibling and the T 
reduction when they have different parents. 

The fact that the result of the application of a /? or a T reduction to A'^ is again a 
galled network has been discussed in the definition of the reductions. □ 

We shall call the inverses of the R and T reductions, respectively, the R^^ and 
T^^ expansions, and we shall denote them by R^.li,^ ^u-vhi h^- More specifi- 

cally, for every galled network N: 

-ifN contains a tree node x with tree children nodes wi,...,wjc such that they do not 
have any descendant node in common, then the R^-l/^^ expansion can be applied 
to A^, and R^-lf^^ /u^'^) obtained by removing the arcs from x to its children 
wi,...,Wk, adding two tree nodes u,v, and k hybrid nodes h\,...,hk, together with 
arcs from x to u and v, from u and v to every added hybrid node, and from hi to w, 
for every i— 1 , . . . , A:; 

- if A' contains a hybrid node h, whose only child w has k children tree nodes w\,...,Wk 
such that they do not have any descendant node in connmon, then the T~^f^^ ex- 
pansion can be applied to A^, and T^^j^^ (N) is obtained by removing the hybrid 
node h and its child w together with their incoming and outgoing arcs and adding 
two tree nodes m, v and k hybrid nodes h\,...,hk together with arcs from one parent 
of /z to M and from the other parent of h to v, from u and v to every added hybrid 
node, and from hi to vv,- for every / = \ ,....k. 

From these descriptions, since wi,...,Wk do not have any descendant node in com- 
mon, we easily see that the result of a R~^ or expansion applied to a galled network 
is always a galled network. 

The following result is easily deduced from the explicit descriptions of the reduc- 
tions and expansions. 

Lemma 2. LetN andN' be two galled networks. IfN = A^', then the result of applying 
to both N andN' the same R~^ expansion (respectively, expansion) are again two 
isomorphic galled networks. 

Moreover, if we apply a RorT reduction to a galled network N, then we can apply 
to the resulting network the corresponding inverse R~^ or expansion and the result 
is a galled network isomorphic to N. □ 

So, by [4, Lem. 6], and since galled networks without any pair of different nodes 
with the same nested label always have different nested label representations, to prove 
that the Nakhleh's distance m separates semibinary galled networks, it is enough to 



prove that the possibility of applying a reduction to a semibinary galled network A'^ 
can be decided from X{N), and that the nested label representation of the result of the 
appUcation of a reduction to a semibinary galled network depends only on A (A^) and 
the reduction. These two facts are given by the the following lemmas. 

Lemma 3. Let N be a galled network on a set S. 

(1) If a reduction Ru-v.ht can be applied to N, then the nodes u,v involved in the 
reduction satisfy the following property: X{u) = A(v) = {{A(wi)}, {A(wjt)}} 
and there is a node x such that {A(m), A(v)} C X{x). 

(2) Conversely, if two nodes m, v satisfy' the property above, and have minimal height 
among those that safisfy it, then a reduction Ru;v,hi,...,hii can be applied to N. 

(3) If Ru;vM,-,hk can be applied to N, let N' be the resulting network. Then A(iV') can 
be computed from X (N) as follows: 

- If A e A {N) does not contain any X (w,) at any level of nesting, then A G A (A''); 

- If A e A (A^) is equal to some X (w,), then A e A (A^'); 

- If A G A(A^) contains at some level of nesting the element {{Ai}, . . . , {A^^}} 
(with multiplicity 2), where Ai = A(w,), then replace it by the k elements A\,. . . ,Ai 
(with multiplicity I) to get A' e X{N'). 

Proof. If a /? reduction can be appUed to a galled network A'^, then it is clear that there 
exists two sibling nodes u and v and hence a common parent x such that X(u) = A(v), 
^{x) ^ {A(m),A(v)}. Since u and v have the same children nodes, these nodes, say 
hi,...,hk, must hybrid. For each i= 1 , . . . , A:, let w,- be the single child of hi. Then it is 
clear that A(m) = A(v) = {{X{wi)},...,{X{wk)}}. 

Conversely, assume that A(m) = A(v) = {{A(wi)}, {A(w,t)}}. From the nested 
label definition and the minimality assumption on the height, this implies that u and v 
have k children nodes which are hybrid nodes. Moreover, since there is a node x such 
that A (x) 3 {A(m), A(v)}, we can conclude that u and v are sibling nodes and then, we 
can apply a R reduction to A'^. 

Now, if Ru;vjii,...,hi- can be applied to A^, then A^' = Rw,v.hi,...,ht{N) is the galled net- 
work obtained by removing the nodes u,v,hi,...,hk, together with their incoming arcs, 
and adding an arc from the parent of u and v to each child ofhi,...,hjc. Thus, the nested 
label of Wi and the nested label of all those nodes being descendant nodes of w, for 
every i= l,..k remains the same as in A'. In the same way, the nested label of all those 
nodes not being ancestors of w; for every i= l,...,k remains the same as in A'^, and 
then, they are in the nested label representation of A^'. Finally, the nested label of the 
ancestors of for every ; = l,..,k must be relabeled since we have delete two in- 
termediate nodes in every path from the ancestor to w,. This impUes, that we delete 
two levels of nesting, one tree node and k hybrid nodes, and then, we must replace 
{{{Ai},...,{Ai}},{{Ai},...,{A,}},...}by{Ai,...,A^,...}. □ 

Lemma 4. Let N be a galled network on a set S. 

(1) If a reduction T^-^vjn can be applied to N, then the nodes u,v involved in the 
reduction satisfy the following property: X(u) = A(v) = {{A(wi)}, {A(w;t)}} 
and there is not any node x such that {A (m) , A (v) } C A (x). 



(2) Conversely, if two nodes u,v satisfy the property above, and have minimal height 
among those that safisfy it, then a reduction r„;y /,j can be applied to N. 

(3) IfTu-^v,hi, -,hk '^^^ applied to N, let N' be the resulting network. Then X{N') can 
be computed from X (N) as follows: 

- If A e X (N) does not contain any X (w,) at any level of nesting, then A e A (A'''); 

- If A e X{N) is equal to some X{wi), then A G X{N'); 

- If A G X{N) contains at some level of nesting the element {{Ai}, {A^}}, 
where Ai — A(w,), then replace it by the elements {{Ai, . . . ,Aft}} (with the 
same multiplicity) to get A' G X{N'). 

- Include also {Ai , . . . ,Aj-} and {{Aj , . . . ,A^}} in X{N'). 

Proof. The proof of this lemma goes the same way as the previous one, taking now into 
account how the nested labels are modified. □ 

As a result, we get the desired result. 

Theorem 1. The distance m defined above is a metric on the space of galled networks 
on a fixed set of labels. □ 
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